In [11]:
# We pull in the training, validation and test sets created according to the scheme described
# in the data exploration lesson.
import pandas as pd
In [12]:
samtrain = pd.read_csv('../datasets/samsung/samtrain.csv')
samtrain['subject'].unique()
Out[12]:
In [13]:
samval = pd.read_csv('../datasets/samsung/samval.csv')
samval['subject'].unique()
Out[13]:
In [14]:
samtest = pd.read_csv('../datasets/samsung/samtest.csv')
samtest['subject'].unique()
Out[14]:
In [15]:
# We use the Python RandomForest package from the scikits.learn collection of algorithms.
# The package is called sklearn.ensemble.RandomForestClassifier
# For this we need to convert the target column ('activity') to integer values
# because the Python RandomForest package requires that.
# In R it would have been a "factor" type and R would have used that for classification.
# We map activity to an integer according to
# laying = 1, sitting = 2, standing = 3, walk = 4, walkup = 5, walkdown = 6
# Code is in supporting library randomforest.py
import randomforests as rf
samtrain = rf.remap_col(samtrain,'activity')
samval = rf.remap_col(samval,'activity')
samtest = rf.remap_col(samtest,'activity')
In [16]:
import sklearn.ensemble as sk
rfc = sk.RandomForestClassifier(n_estimators=500, compute_importances=True, oob_score=True)
train_data = samtrain[samtrain.columns[1:-2]]
train_truth = samtrain['activity']
model = rfc.fit(train_data, train_truth)
In [17]:
# use the OOB (out of band) score which is an estimate of accuracy of our model.
rfc.oob_score_
Out[17]:
In [32]:
### TRY THIS
# use "feature importance" scores to see what the top 10 important features are
fi = enumerate(rfc.feature_importances_)
cols = samtrain.columns
[(value,cols[i]) for (i,value) in fi if value > 0.04]
## Change the value 0.04 which we picked empirically to give us 10 variables
## try running this code after changing the value up and down so you get more or less variables
## do you see how this might be useful in refining the model?
## Here is the code in case you mess up the line above
## [(value,cols[i]) for (i,value) in fi if value > 0.04]
Out[32]:
We use the predict() function using our model on our validation set and our test set and get the following results from our analysis of errors in the predictions.
In [19]:
# pandas data frame adds a spurious unknown column in 0 position hence starting at col 1
# not using subject column, activity ie target is in last columns hence -2 i.e dropping last 2 cols
val_data = samval[samval.columns[1:-2]]
val_truth = samval['activity']
val_pred = rfc.predict(val_data)
test_data = samtest[samtest.columns[1:-2]]
test_truth = samtest['activity']
test_pred = rfc.predict(test_data)
In [20]:
print("mean accuracy score for validation set = %f" %(rfc.score(val_data, val_truth)))
print("mean accuracy score for test set = %f" %(rfc.score(test_data, test_truth)))
In [21]:
# use the confusion matrix to see how observations were misclassified as other activities
# See [5]
import sklearn.metrics as skm
test_cm = skm.confusion_matrix(test_truth,test_pred)
In [22]:
# visualize the confusion matrix
In [23]:
import pylab as pl
pl.matshow(test_cm)
pl.title('Confusion matrix for test data')
pl.colorbar()
pl.show()
In [24]:
# compute a number of other common measures of prediction goodness
We now compute some commonly used measures of prediction "goodness".
For more detail on these measures see
[6],[7],[8],[9]
In [33]:
# Accuracy
print("Accuracy = %f" %(skm.accuracy_score(test_truth,test_pred)))
# Precision
print("Precision = %f" %(skm.precision_score(test_truth,test_pred)))
# Recall
print("Recall = %f" %(skm.recall_score(test_truth,test_pred)))
# F1 Score
print("F1 score = %f" %(skm.f1_score(test_truth,test_pred)))
Instead of using domain knowledge to reduce variables, use Random Forests directly on the full set of columns. Then use variable importance and sort the variables.
Compare the model you get with the model you got from using domain knowledge.
You can short circuit the data cleanup process as well by simply renaming the variables x1, x2...xn, y where y is 'activity' the dependent variable.
Now look at the new Random Forest model you get. It is likely to be more accurate at prediction than the one we have above. It is a black box model, where there is no meaning attached to the variables.
[1] Original dataset as R data https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda
[2] Human Activity Recognition Using Smartphones http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
[3] Android Developer Reference http://developer.android.com/reference/android/hardware/Sensor.html
[4] Random Forests http://en.wikipedia.org/wiki/Random_forest
[5] Confusion matrix http://en.wikipedia.org/wiki/Confusion_matrix
[6] Mean Accuracy http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1054102&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1054102
[7] Precision http://en.wikipedia.org/wiki/Precision_and_recall [8] Recall http://en.wikipedia.org/wiki/Precision_and_recall [9] F Measure http://en.wikipedia.org/wiki/Precision_and_recall
In [29]:
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[29]:
In [29]: